Skip to content

ossl: improve error queue hygiene#267

Merged
pgellert merged 5 commits intoredpanda-data:v26.1.xfrom
pgellert:ossl/error-queue-hygiene
Apr 24, 2026
Merged

ossl: improve error queue hygiene#267
pgellert merged 5 commits intoredpanda-data:v26.1.xfrom
pgellert:ossl/error-queue-hygiene

Conversation

@pgellert
Copy link
Copy Markdown

@pgellert pgellert commented Apr 10, 2026

OpenSSL uses a per-thread error queue to communicate failure details. In seastar's cooperative scheduling model, multiple TLS sessions share the same thread. If stale errors are left on the queue across a scheduling point, SSL_get_error on an
unrelated session can misclassify the result — e.g. reporting SSL_ERROR_SYSCALL instead of SSL_ERROR_SSL — because SSL_get_error peeks at the oldest error on the queue to decide the classification.

We have hit this in practice (e1625c8, cd02ecc) and it resulted in hard-to-debug customer issues. This PR makes the following changes to harden the error queue handling. See the commit messages for further details.

Drain stale errors on error paths:

Drain stale errors after successful operations:

  • ossl: clear error queue after successful SSL_CTX_new — SSL_CTX_new can return success while leaving errors from partially-failed system config parsing
  • ossl: clear error queue after successful SSL operations — SSL_shutdown, SSL_do_handshake, SSL_write_ex, and SSL_read_ex can return success while leaving errors from internal BIO write failures that OpenSSL absorbed. Root cause is a return-value
    translation gap in bio_write_intern where our bwrite callback's 0-means-failure is misclassified as success by tls_retry_write_records. See commit message for detailed call chains.

Assert queue is clean:

Fixes https://redpandadata.atlassian.net/browse/CORE-15655

@pgellert pgellert requested review from a team, ballard26, dotnwat, rockwotj and travisdowns April 10, 2026 14:27
@pgellert pgellert self-assigned this Apr 10, 2026
@pgellert pgellert requested review from Copilot and nguyen-andrew and removed request for a team April 10, 2026 14:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves OpenSSL per-thread error queue hygiene in Seastar’s TLS/OpenSSL backend to prevent stale errors from leaking across cooperative scheduling points and causing misclassification in SSL_get_error().

Changes:

  • Adds make_unknown_ossl_error() to surface and drain the OpenSSL error queue for unexpected conditions.
  • Switches BIO allocation failure to throw make_ossl_error() for richer diagnostics.
  • Introduces verify_clean_error_queue() checks before key SSL operations and drains/logs leftover errors after SSL_CTX_new().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/net/ossl.cc
Comment on lines +1844 to +1845
tls_log.warn("{} stale error on queue before {}: {}", *this, operation, buf);
SEASTAR_ASSERT(0 && "stale errors on OpenSSL error queue");
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_clean_error_queue() calls SEASTAR_ASSERT unconditionally when the thread error queue is non-empty. SEASTAR_ASSERT is active in all build modes (include/seastar/util/assert.hh:29), so any stale OpenSSL error (including ones OpenSSL may legitimately leave behind on success) will abort the process. Consider draining/clearing the error queue here to preserve the intended hygiene in production, and gate the hard assert behind a debug-only macro if you still want a fail-fast diagnostic.

Suggested change
tls_log.warn("{} stale error on queue before {}: {}", *this, operation, buf);
SEASTAR_ASSERT(0 && "stale errors on OpenSSL error queue");
size_t cleared = 0;
while (ERR_get_error() != 0) {
++cleared;
}
tls_log.warn("{} cleared {} stale error(s) on queue before {}: {}", *this, cleared, operation, buf);
assert(false && "stale errors on OpenSSL error queue");

Copilot uses AI. Check for mistakes.
Comment thread src/net/ossl.cc Outdated
Comment on lines +2016 to +2020
// SSL_CTX_new can return a valid context while leaving errors on the
// error queue from partially-failed system config parsing (e.g. an
// invalid Ciphersuites value in the system openssl.cnf).
// See https://github.com/openssl/openssl/issues/30760
if (auto errors = get_all_ossl_errors(); !errors.empty()) {
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning attributes any queued errors after SSL_CTX_new() to that call, but the code doesn’t clear the per-thread error queue before invoking SSL_CTX_new(). If the queue was already dirty from earlier work on the same thread, this log will be misleading (and will also silently drain those unrelated errors). Consider clearing/draining the queue immediately before SSL_CTX_new() (and optionally logging if it wasn’t empty) so this warning only reflects new errors generated by SSL_CTX_new().

Copilot uses AI. Check for mistakes.
Comment thread src/net/ossl.cc
Comment on lines 1195 to 1199
// This do_until runs until either a renegotiation occurs or the packet is empty
while (!eof() && size > 0) {
size_t bytes_written = 0;
verify_clean_error_queue("SSL_write_ex");
auto write_rc = SSL_write_ex(_ssl.get(), ptr, size, &bytes_written);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new pre-call error-queue invariant (verify_clean_error_queue before SSL_write_ex/SSL_read_ex/SSL_do_handshake/SSL_shutdown) is central to the fix, but there’s no regression test exercising the stale-error-queue scenario described in the PR. Since TLS already has unit coverage (tests/unit/tls_test.cc), consider adding a test that deliberately leaves an error on the OpenSSL per-thread queue and verifies a subsequent, unrelated TLS operation does not misclassify via SSL_get_error (and/or that the queue is drained/handled as intended).

Copilot uses AI. Check for mistakes.
@pgellert
Copy link
Copy Markdown
Author

The new verification logic is highlighting some possible error leaks (unit test failures here and redpanda ducktape test failures), so I am putting this PR into draft mode until I fix those.

@pgellert pgellert marked this pull request as draft April 10, 2026 16:07
Every other OpenSSL failure path in the session constructor uses
make_ossl_error, which drains the error queue and includes the
OpenSSL error details in the exception. This one used a plain
std::runtime_error. Align it for consistency.
Drain the error queue and include the OpenSSL error details in
the exception for the default switch cases in handle_do_put_ssl_err,
do_handshake, do_get and do_shutdown.

These default cases are not expected to be reachable — the
SSL_get_error codes they cover require callbacks, modes, or BIO
configurations that seastar does not use. This change is defensive:
if they are ever reached, the error queue is now drained and the
OpenSSL error details are included in the exception message rather
than being silently discarded.
SSL_CTX_new can return a valid context while leaving errors on the
per-thread error queue. This happens when OpenSSL's system config
parsing partially fails but the failure is masked — for example,
ssl_do_config may call SSL_CONF_cmd which pushes errors (e.g.
SSL_R_NO_CIPHER_MATCH from an invalid Ciphersuites value in the
system openssl.cnf), but ssl_do_config itself returns success when
the system flag is set and conf_diagnostics is disabled.

Introduce a clear_stale_ssl_errors helper that drains and logs any
stale errors at debug level, and use it after SSL_CTX_new succeeds.
Several SSL operations can return success while leaving errors on
the per-thread error queue. This happens when OpenSSL internally
writes through our custom BIO (e.g. to send a close_notify alert,
a TLS 1.3 NewSessionTicket, or an application data record), our
bwrite callback fails with EPIPE and returns 0, but OpenSSL's
record layer misclassifies the failure as success.

Root cause: bio_write_intern (crypto/bio/bio_lib.c) passes our
bwrite return of 0 through unchanged. BIO_write returns 0. Then
tls_retry_write_records (ssl/record/methods/tls_common.c) checks
'if (i >= 0)' and classifies BIO_write returning 0 as success.
Our bwrite returns 0 following the documented BIO_write_ex contract
(1 success, 0 failure), but BIO_write's own contract says 0 means
"BIO is NULL or dlen <= 0" — not an error. The read side had this
same class of bug, fixed on master in OpenSSL commit be42447469.

The affected operations and how they reach our bwrite callback:

- SSL_shutdown: ssl3_shutdown -> ssl3_send_alert ->
  ssl3_dispatch_alert -> write_records -> tls_retry_write_records
  -> BIO_write -> bio_write_intern -> our bwrite callback
  (sending close_notify alert)

- SSL_do_handshake: state_machine -> TLS_ST_SW_SESSION_TICKET ->
  write_records -> tls_retry_write_records -> BIO_write ->
  bio_write_intern -> our bwrite callback (sending NewSessionTicket).
  Additionally, statem_srvr.c deliberately ignores flush failures
  via conn_is_closed() for this case.

- SSL_write_ex: ssl3_write_bytes -> tls_write_records ->
  tls_retry_write_records -> BIO_write -> bio_write_intern ->
  our bwrite callback (sending application data record)

- SSL_read_ex: can internally trigger writes (e.g. TLS 1.3 key
  update responses) through the same record layer write path.
  Not yet observed in test logs but covered defensively.

This is a workaround for the missing return-value translation in
bio_write_intern and can be removed once the upstream fix is
available.
OpenSSL's SSL_get_error relies on the per-thread error queue to
classify failures. In seastar's cooperative scheduling model, multiple
TLS sessions share the same thread. If one session leaves stale errors
on the queue and then yields, another session's SSL_get_error call may
misclassify the error — e.g. reporting SSL_ERROR_SYSCALL instead of
SSL_ERROR_SSL or vice versa — because SSL_get_error peeks at the
oldest error on the queue to decide the classification (ssl_lib.c,
ossl_ssl_get_error).

We have seen error queue contamination cause issues in practice
(e1625c8 "net: avoid propagating system errors to errno", cd02ecc
"ossl: Added ERR_clear_error if disconnected post write"). The current
code is believed to be correct — all error paths drain the queue
before scheduling points — but we add these checks to aid debugging
if a regression is introduced in the future.

Add verify_clean_error_queue() checks before every SSL_do_handshake,
SSL_write_ex, SSL_read_ex and SSL_shutdown call. If stale errors are
found, the first entry is logged at warn level and an assertion fires
to surface the problem in tests.
@pgellert pgellert force-pushed the ossl/error-queue-hygiene branch from d0e2824 to e5f91e1 Compare April 15, 2026 10:35
@pgellert
Copy link
Copy Markdown
Author

force-push: I've now investigated all the failures highlighted by the new validation logic and pushed further bug fixes to clear stale errors on the thread-local error queue in various cases (see the commit messages for specifics).

I'm testing these changes through redpanda CI on redpanda-data/redpanda#30125.

@pgellert pgellert marked this pull request as ready for review April 15, 2026 10:43
@pgellert pgellert merged commit 84f6ad4 into redpanda-data:v26.1.x Apr 24, 2026
55 of 56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants